99 research outputs found
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
Larger language models (LLMs) have taken the world by storm with their
massive multi-tasking capabilities simply by optimizing over a next-word
prediction objective. With the emergence of their properties and encoded
knowledge, the risk of LLMs producing harmful outputs increases, making them
unfit for scalable deployment for the public. In this work, we propose a new
safety evaluation benchmark RED-EVAL that carries out red-teaming. We show that
even widely deployed models are susceptible to the Chain of Utterances-based
(CoU) prompting, jailbreaking closed source LLM-based systems such as GPT-4 and
ChatGPT to unethically respond to more than 65% and 73% of harmful queries. We
also demonstrate the consistency of the RED-EVAL across 8 open-source LLMs in
generating harmful responses in more than 86% of the red-teaming attempts.
Next, we propose RED-INSTRUCT--An approach for the safety alignment of LLMs. It
constitutes two phases: 1) HARMFULQA data collection: Leveraging CoU prompting,
we collect a dataset that consists of 1.9K harmful questions covering a wide
range of topics, 9.5K safe and 7.3K harmful conversations from ChatGPT; 2)
SAFE-ALIGN: We demonstrate how the conversational dataset can be used for the
safety alignment of LLMs by minimizing the negative log-likelihood over helpful
responses and penalizing over harmful responses by gradient accent over sample
loss. Our model STARLING, a fine-tuned Vicuna-7B, is observed to be more safely
aligned when evaluated on RED-EVAL and HHH benchmarks while preserving the
utility of the baseline models (TruthfulQA, MMLU, and BBH)
Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases
Red-teaming has been a widely adopted way to evaluate the harmfulness of
Large Language Models (LLMs). It aims to jailbreak a model's safety behavior to
make it act as a helpful agent disregarding the harmfulness of the query.
Existing methods are primarily based on input text-based red-teaming such as
adversarial prompts, low-resource prompts, or contextualized prompts to
condition the model in a way to bypass its safe behavior. Bypassing the
guardrails uncovers hidden harmful information and biases in the model that are
left untreated or newly introduced by its safety training. However,
prompt-based attacks fail to provide such a diagnosis owing to their low attack
success rate, and applicability to specific models. In this paper, we present a
new perspective on LLM safety research i.e., parametric red-teaming through
Unalignment. It simply (instruction) tunes the model parameters to break model
guardrails that are not deeply rooted in the model's behavior. Unalignment
using as few as 100 examples can significantly bypass commonly referred to as
CHATGPT, to the point where it responds with an 88% success rate to harmful
queries on two safety benchmark datasets. On open-source models such as
VICUNA-7B and LLAMA-2-CHAT 7B AND 13B, it shows an attack success rate of more
than 91%. On bias evaluations, Unalignment exposes inherent biases in
safety-aligned models such as CHATGPT and LLAMA- 2-CHAT where the model's
responses are strongly biased and opinionated 64% of the time.Comment: Under Revie
Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model
Numerous studies in the field of music generation have demonstrated
impressive performance, yet virtually no models are able to directly generate
music to match accompanying videos. In this work, we develop a generative music
AI framework, Video2Music, that can match a provided video. We first curated a
unique collection of music videos. Then, we analysed the music videos to obtain
semantic, scene offset, motion, and emotion features. These distinct features
are then employed as guiding input to our music generation model. We transcribe
the audio files into MIDI and chords, and extract features such as note density
and loudness. This results in a rich multimodal dataset, called MuVi-Sync, on
which we train a novel Affective Multimodal Transformer (AMT) model to generate
music given a video. This model includes a novel mechanism to enforce affective
similarity between video and music. Finally, post-processing is performed based
on a biGRU-based regression model to estimate note density and loudness based
on the video features. This ensures a dynamic rendering of the generated chords
with varying rhythm and volume. In a thorough experiment, we show that our
proposed framework can generate music that matches the video content in terms
of emotion. The musical quality, along with the quality of music-video matching
is confirmed in a user study. The proposed AMT model, along with the new
MuVi-Sync dataset, presents a promising step for the new task of music
generation for videos
KNOT: Knowledge Distillation using Optimal Transport for Solving NLP Tasks
We propose a new approach, Knowledge Distillation using Optimal Transport
(KNOT), to distill the natural language semantic knowledge from multiple
teacher networks to a student network. KNOT aims to train a (global) student
model by learning to minimize the optimal transport cost of its assigned
probability distribution over the labels to the weighted sum of probabilities
predicted by the (local) teacher models, under the constraints, that the
student model does not have access to teacher models' parameters or training
data. To evaluate the quality of knowledge transfer, we introduce a new metric,
Semantic Distance (SD), that measures semantic closeness between the predicted
and ground truth label distributions. The proposed method shows improvements in
the global model's SD performance over the baseline across three NLP tasks
while performing on par with Entropy-based distillation on standard accuracy
and F1 metrics. The implementation pertaining to this work is publicly
available at: https://github.com/declare-lab/KNOT.Comment: COLING 202
- …